﻿ Very large language resources? At our finger! Dan Cristea “Alexandru Ioan Cuza” University of Iasi 16, Berthelot St , 700483 – Iasi, Romania E-mail: dcristea@info uaic ro Abstract The paper proposes a legislative initiative for acquiring large scale language resources It militates for raising a large awareness campaign that would allow the storing and preservation for research purpose, in electronic form, of all textual documents which go to print in a country 1 Introduction Enhancing the legislation on legal This paper brings into attention a proposal for conserving 2 over long time and using largely, at a national level, for deposits research purposes, the linguistic data which are printed A recent investigation among some of the most important and distributed for public use daily by editorial houses producers of printed information in Romania revealed It is evident that, without a continuous effort, those that many editing houses are keen to donate their languages which are now called “less-resourced” will resources for research purposes However, another continue to be viewed like that even when, hypothetically, fraction, which unfortunately makes the majority, is not they will promote to the same amount of resources as the interested to collaborate They ignore the importance of languages that at this very moment are known to be most the issue, are fearful that donating their data is equivalent resourced Moreover, if the most resourced languages to loosing the property control over them, will possibly would cease to acquire resources now, on the ground that trigger a loss of profit, or simply do not have time to they have fulfilled their needs, in short time they will lose dedicate to this kind of matters their leading positions This is because LRs become In reality, nothing of the kind has to happen Although we obsolete very quickly Even more, if we look at the need their linguistic data, we do not want the resourcers to annotated resources, the linguistic facts which are subject be harmed if they give their data to science The idea is to to automatic annotation could change over time, as the promote a legislative initiative that imposes the linguistic theories on which the marking conventions are compulsoriness for the resourcers to donate their based evolve, and as the automatic annotation processes linguistic data for language research The proper moment themselves get improved So, as the language goes along has come to try to raise the awareness for a concentrated and evolves and our vision with respect to the language action in Europe We need to raise governmental interest changes, the resources, themselves, get old There is no towards the promotion of such legislation, simultaneously end in building LRs in many countries In many countries a “legal deposit” law is in use It The following type of resources, produced in series, obliges all providers of printing materials (editing houses, would be in focus to such a law, irrespective whether the physical or juridical persons which print documents for resources are intended for commercial or for free public, recording houses and studios, the National Bank, distribution: books, booklets, leaflets, journals, magazines, the State Mint, the National Post, etc ) – let’s call them almanacs, calendars, musical scores, propagandistic resourcers – to send a number of copies of each printed materials having a political, administrative, cultural, item intended for distribution to a national library (which artistic, scientific, educational, religious, a s o goal, could be one physical unit or a consortium of libraries) for posters, proclamations, any other materials intended for long-time preservation Although the horizon of media publication on public places, Ph D thesis, university production changed dramatically in the last years, to my courses, documents in electronic format containing knowledge, there are only very timid trials for linguistic material (CDs, DVDs, etc ), standards and improvement of the juridical aspects technical norms, publications issued by national and local As resources are needed dramatically and many of them authorities, collections of norms and laws, any other are very expensive, the issue of acquiring them should printed or multiplied material by using graphical or stop from being accidental or episodic and should become physical-chemical methods a national policy Something should be done A law On the practical level, the initiative presupposes the should defend the linguistic resources of the languages existence of a national repository, which is an entity (IT spoken in a country as being of primary interest This center, institute, etc – let’s call it the Portal ), which, on paper discusses one possible solution which, although not one hand, has the legal authority to receive and store data simple to implement, could change completely the LRs contributed by resourcers, and, on the other hand, is scene in the near future technically equipped to collect and record, indefinitely PID provider Header (H) PID PID request 4 1 3 RID + H + I I (document) 2 The Portal 6 OK + The Resourcer: RID PID+ RID + H + I 5 Repository Figure 1 a I (document) 7 The Resourcer: RID Figure 1 b long, in electronic format, all data issued for publication, editorial approval The Figure 1 a explains the daily, in a country communication initiated by this new item The Resourcer The law should state that by sending an electronic copy fills in an electronic form (header – H), containing for long-time preservation to this national repository no identification information of the document, and then authoring rights or commercial benefits are lost by the interacts with the Portal , uploading its RID, the header H Resourcer The copy can be used, intermediated by the and an editable copy of I The Portal receives this data Portal , only for research purposes applied to language and asks for a persistent identification code ( PID) to an and the Portal cannot make public the data on internet or authority capable of issuing them (Kunze and Rogers, on other media, unless it is asked to do so by the owner It 2003; Schwardmann, 2009) When it gets one, it stores in is clear that a fragile IPR chapter will not be acceptable in its repository a bunch of data containing PID, RID, H, and the text of this law (COM, 2009) A weak statement of IT I Then, the Portal returns to the Resourcer an OK security measures to protect the authors’ rights will also message, containing two parts: a human readable part and be amendable All these aspects are very important and a bar code part The OK box should record a seal of the should receive full attention in the formulation of this law Portal , together with the PID and the RID Now the document, which has also this OK box, included on an 3 The capturing flow inner cover or on a sleeve, can be printed (Figure 1 b) This box proves to any authority in charge of controlling I see the Portal as a factory that processes words The start the application of the law that the legal deposit was elements of the data flow should be as follow: before performed by the Resourcer on the Portal , and all the issuing the first publication, or at the moment the law is needed identification information is there imposed, the Resourcer should have got an identification code ( RID) from the Portal It will use this code for The above detailed exchange of data between the communication with the Portal regarding any publication, Resourcer and the Portal , including also a communication during all its juridical lifetime with a third entity responsible for issuing PIDs, seems heavy and time consuming, and if so, totally unacceptable Suppose that today the Resourcer prepares for publication by the editing houses Indeed, it is a known fact that these a new item I, which has got the “ready for printing” entities are most of the time constrained to process data at great speed, especially, if they print daily newspapers, for frequency, over a certain interval, could prove not being example Nevertheless, the communication which, as is relevant, because there are words which are very rarely described above, appears to be heavy and cumbersome, used, although they could not be in danger of being can be done as quickly as a blink of an eye by making considered extinguishable (some science neologisms, for completely automatic the whole chain, including the instance) The best way to do this is to associate to each fill-in of identification information contained in the word a personal file, recording a set of dynamic features, header H The content of the header can be extracted by among which the frequency of occurrence over time (a specialized modules from the electronic item I So, graphic, from which a gradient of deterioration could be practically, the entire chain could be activated by a click computed), the list of registers that use it (with the on a button on the editing interface This should end up, associated relative frequencies), etc So, the problem almost instantly, with the inclusion of the OK box in a resides in computing the frequency over a constant dedicated place of the document going to print interval of time, considered always back from the current day One could do this by simply searching the spotted 4 Data processing word in the repository and counting only the occurrences Once captured, data on the Portal should be processed In that fall in the needed interval – a function that would be this section I describe a list of processing capabilities that called only once in a certain long interval – say two to five the Portal should be able to provide years (because one cannot expect that the tagging “obsolete” can be updated too frequently, from yesterday First, it is obvious that the Portal should have sufficient to today…) storing capacities and that these capacities should be specially designed for preserving data indefinitely long It is clear that any decision on anyone of these positions periods of time Then if should display indexing, search should ultimately be taken by a linguistic authority and retrieval capacities, at different levels: header, lexical (Academia) Their decisions should investigate the tokens (words), lexical expressions, as well as contextual signals transmitted by the Portal, which are rooted on neat information This means that each document, once placed statistical evidence on the portal, should be submitted to a processing chain Different processing flows could implement other that includes, minimally: tokenization, part-of-speech functions A number of resources, which are of increasing tagging, lemmatization and indexing It is foreseeable importance in keeping a language technologically therefore that each document will be recorded as raw text updated, can be continuously connected onto the Portal on which the standoff XML annotation will make Among these, I see: the main Dictionary of the language, reference The XML annotation and the indexing the WordNet (Fellbaum, 1998), the VerbNet (Kipper et al , requirements will most probably multiply the size of the 2008), the FrameNet (Fillmore, 1976; Atkins et al, 2003) initial text documents a couple of times – to name just a few Supposing all these resources are Based on these basic functionalities, a different line of complete for the language L, at a certain moment, they processing refers to lexicographic needs The Portal should be kept updated with the evolution of language So, should be able to perform complex operations such as: any dynamics in language should be mirrored in these detection of foreign words, signaling of new words, resources as well If, as suggested above, each lexical item recognition of senses of words in context (WSD), of the language has a personal record on the Portal, then if detection of new senses, signaling of forgotten (obsolete) should include references in all these resources As such, words, signaling of senses which are no more used, etc the word w is linked to its input in the Dictionary, where For instance, signaling of new words and of forgotten the inventory of senses is recorded, and these senses are (obsolete) words should be triggered by a frequency of aligned to those listed in the WordNet for this lexical item, occurrence which, over a given interval of time, is as well to its entry in VerbNet and FrameNet All these above/below certain thresholds, as decided by a linguistic resources are connected among them and kept online with authority Similarly, signaling of a new sense could be the evolution of language by the Portal triggered by the fail to align the sense recognized in The Portal can host also a number of services addressed to context to those kept in a repository of senses, like for the resourcers, to the language researchers, to the instance an authoritarian explanatory dictionary, if this consumers or to the public at large Public services could happens with a certain frequency recently, and if the be charged to the customers and benefits be returned to pattern of use is sufficiently stable Forgotten (obsolete) the resourcers, in amounts proportional to their monthly senses are recognized by the occurrence of these senses contribution on the Portal (measured in characters) under a certain threshold Other types of paid services could be imagined, with The process which should be placed at the base of benefits returned to the resourcers, for instance recognizing obsolete words or senses presupposes placing advertising publications and on-line access to parts of a bag of words under constant surveillance These are their publications, which they are keen to offer on the words/senses plausible of becoming under-used because market The possibility to develop a set of services from they experience a constantly degrading frequency Let’s which the resourcers could obtain profit is interesting also note that the criterion of absolute or even relative from the point of view of potentially lowering the resources’ opposition vis-a-vis of a law that would impose of free access to science6 It, however, does not advocate the obligation of continuous language preservation, as has against intellectual property (Stephan, 2001), but is very been discussed in section 2 much in favor of a reconsideration of the IPR legislation, which is too restrictive in many cases of usage of 5 Evaluation language resources for research After all, our language, It is clear that the type of processing encumbered by such as we use it today, represents a collective contribution and an initiative would bring to the Portal a very big amount is due to a perpetual reshaping from all its speakers from of linguistic data daily A rough evaluation of the the beginning of the time… Donating his linguistic processing needs and costs encumbered by such a creation for language preservation and research, while not national-wide enterprise should bring into focus harming at all its creator, neither intellectually, nor parameters such as: the number of editorial houses commercially, represents just the minimum return that an registered, the average number of publications of a author which uses the language owes to those who have publishing house per year, the average length in pages of a invented it, for the benefit of those which will use it in the printed item, the average number of characters per page future Leaving aside episodic publications of small size, our References enquiry about the average amount of data published in 7 books and journals, in a medium size country of Europe (Romania), at the level of the year 2008, has yielded an Atkins, S , Rundell, M and Sato, H (2003) The amount of textual data which is less than 1Gb daily Contribution of Framenet to Practical Lexicography, A channel with a bandwidth of 12 5 Mb/sec can lightly International Journal of Lexicography, Volume 16 3: face the required transfer described in section 3, avoiding 333-357 bottlenecks on moments of crowd Load balancing and COM (2009) 532 – Communication from the mirroring, for safety reasons, should be assured, by Commission Copyright in the Knowledge Economy storing the data on at least two centers, in different Fillmore, C J (1976): Frame semantics and the nature of locations As proved already by data intensive storing language In Annals of the New York Academy of Sciences: houses (Google1, for instance), software RAID Conference on the Origin and Development of Language technology, made up of a farm of small computers, is a and Speech, Volume 280: 20-32 cheap and appropriate solution for long time preservation Kipper,K , Korhonen, A , Ryant, N , Palmer, M (2008) A and a comfortable processing speed Large-scale Classification of English Verbs, Language Resources and Evaluation Journal , 42(1), pp 21-40, 6 Conclusions Springer Netherland The advantages of a Portal able to process linguistic data Kunze, J and R P C Rogers (2003) The ARK Persistent at a scale as the one envisioned above are hard to depict Identifier Scheme Internet draft at now correctly First of all, it will give a long-time and http://www cdlib org/inside/diglib/ark/arkspec pdf complete solution to the problem of linguistic data Schwardmann, U (2009) PID System for eResearch preservation for the language(s) of a nation, as well as an EPIC – the European Persistant Identifier Consortium, almost complete radiography of its diachronic evolution personal communication at NEERI-09, Helsinki Secondly, it will put the basis for an exhaustive research related to language Thirdly, it could bring into focus a Stephan, K "Against Intellectual Property" Journal of large scale of commercially appealing applications, in the Libertarian Studies 15 2 (Spring 2001): 1-53 benefit of the authors of the texts or the resourcers Fellbaum, C (1998) WordNet: An Electronic Lexical Database, MIT Press The success of such an initiative at national level depends very much on a large concentrated vision The new and very fresh breath that is being felt at this moment in Europe with respect to building language processing infrastructures, to establish standards for representation of linguistic data, and to foster large scale initiatives for the acquisition of linguistic resources, as motored by recent 6 consortiums like CLARIN2, FlareNet3, T4Me4, Meta-Net5, See, for instance, the Washington D C Principles For etc should also move forward a favorable legislation The Free Access to Science at proposal advanced in this paper is also in line with other http://www dcprinciples org/statement pdf, the Open Access initiative http://www eprints org/openaccess/, the initiatives that try to raise the awareness on the necessity American Scientist Open Access Forum http://amsci-forum amsci org/archives/American-Scientis 1 http://infolab stanford edu/~backrub/google html t-Open-Access-Forum html, The SPARC Open Access 2 www clarin eu Newsletter (see an issue at http://www earlham edu/ 3 http://www flarenet eu/ ~peters/fos/newsletter/01-02-10 htm), the Budapest Open 4 http://t4me dfki de/ Access Initiative http://www soros org/openaccess, 5 http://www meta-net eu/ 